FINAL PROJECT-MACHINE LEARNING

Libraries And Importing Data Set

Exploratory Analysis

Exploratory Analysis is term used in data analysis, which is used to explore the data and extract useful information from it.
Exploratory Analysis is carried out once you have all the data collected, cleaned and processed. You find the exact information you need to carry out the analysis or if you need more through manipulating the data.
During this step, one can use various techniques in python (such as functions and plots) to carry out analysis, which leads to the understanding of the data in order to become able to interpret it more effectively and derive better conclusions according to the requirements.

Merging both dataframes

Checking the new column:

Dimensionlaity of the Model

Another way for checking the dimensionality of the model:

Exploring columns

Checking columns names in the data file:

Exploring the type of columns:

Informations about the data:

Summary of data

Booking cancellation

Checking how many booking cancellation were done:

In this pie graph we can see the percentage of bookings that were cancelled and bookings that were not:
1 indicates a cancelled booking
0 indicates a not cancelled booking

Order's types

Checking the amount of orders for every type:

In this pie graph we see the percentages for each type of orders:
We can notice that most people order "online TA"

Countries with the highest bookings

Checking which country citizens book the most:

Top 10 countries:

Bar plot describing the amount of citizens from countries:

Pie plot describing the amount of citizens from countries:

Countries with the most guests

Creating a new feature thats give us the total number of guests: (includes adults/children/babies)

Showing the top 5 countries with guests that didnt cancell their booking:

A world map showing the countries with the number of guests:
Yellow indicates the largest number of guests

Month with the most orders

Bar plot that indicates the number of orders that didnt be cancelled by each month .
We can see that August and July has most order, we notice that its summer vacation so most of the families prefer to go for a vacation

We can see in this bar plot that the highest ADR is in August and we also can notice that most cancelations happens in the same month,we can conclude that adr is the reason of that.

Type of customers that cancel their booking

Crosstab showing the number of customer types that cancelled their booking:
We can see that people with transient cancel the most

Catplot that shows the cancelations/not according to the customer type:

Deposite type effect on cancellation

Crosstab showing the number of deposit types that cancelled their booking:
We can see that most of the cancellations are of those booking where no deposit

Catplot that shows the cancelations/not according to the deposite type:

Percentage of cancellation each year

Barplot showing the adverage of cancellation by year:

Density Curve of Time untill order by cancelation

Feature Histogram

Preprocessing

A preliminary processing of data in order to prepare it for the primary processing or for further analysis.

Correlation between features before handling the data

Another Way for Checking Correlation

Changing bool into numeric

Changing the type of "Cancelation" column from bool into int by changing "True" into 1 and "False" into 0:

Handling null values

handling null values using interpolation:
Interpolation is handle both object and numeric values easily.
(Interplation is exaplained in the Report)

Two interpolation method will be used

Handling infinte values

Convert to Categorial Varaibles

Will use cat labelencoder to convert them easily:
(also exaplained in the Report)
The categorical type is a process of factorization. Meaning that each unique value or category is given a incremented integer value starting from zero.

Handling Date Column

Converting Data columns into one date column:
Datetime function select's specific columns like year, month and day so predefined columns name will be change:

Few of the rows has wrong date entry like June month is of 30 days but here there are entries for 31,So we will drop these entries:

now the months are with the correct num of days

Creating Date Object:

Model can't train on date so to handle it each date will be converted into index:

Outliers

Using Z-score to find the outliers

Removing Outliers:

Checking The New Dataset without Outliers:

Correlation between features and Removing Features

Apply Person Correlation to find the highly correlated features

The correlation coefficient has values between -1 to 1 —

High Correlated Variables tends to have smiliar inforamtion which tend to bring down the performace of model so highly correlated features will be removed from the model

Removing features

Following features were removed from the dataset because of their high corellation:

Separate dataset and labels

Dimensonality Reduction and Feature Selection

On machine learning, the performance of a model only benefits from more features up until a certain point. The more features are fed into a model, the more the dimensionality of the data increases. As the dimensionality increases, overfitting becomes more likely.
The Dimensonality Problem usually occur when there is high number of features as they can directly effect the model prediction.
Here the dataset consist of only 35 features and does not have high dimensonality plus by using feature importance only the improtant feautres are being used to train the model hence demionsality reduction is not required for this dataset.

Feature Selection

Since there are number of anontated features and we dont know what they actualy repsent to, we will apply feature imporatance technique to check the importance of each feature and its effect on the model training.

Here we can see that Babies has the least affect on the prediction of model ,but country has the most affect .
also there are many anontated features that have effect on the model prediction so those will be used and left for model training

So for the Model Top 20 Features are selected

Scale the values (Normalization)

Feature Scaling or Standardization: It is a step of Data Pre Processing which is applied to independent variables or features of data. It basically helps to normalise the data within a particular range. Sometimes, it also helps in speeding up the calculations in an algorithm. So, Normalization is imporant to scale all data equally for beter results
We can see from looking at the data that its not normalized. For scaling we are using minmax

Splitting train data into train and test data

Preprocessing Test Data

Creating function in order to preprocess:

Models

Method Follow three Steps:

These are functions that we called in the simple models and advanced models:

KNN

Parameters:

In This Model:

Logistic Regression

In This Model:

Advance Models

Random Forest

Parameters:

bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False

In This Model:

MLP Classifier

Parameters:

activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate='constant', learning_rate_init=0.001, max_fun=15000, max_iter=200, momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False

**In This Model

Model Evaluation

The Confustion Metric and The K-Fold Cross Validation is added in the Models/Advance Models Part.
If we look at the accuacry difference of training and test in every K-Fold we can detect that although there was little biase toward the training data but Overall Models are Not Overfitting Like MLP, SVC, Logistic Regression etc

Confusion Metric

Confusion metric is used to evaluate how much the model has predicted correctly, It combine true label and predicted label and gave its evaluation in four ways:

K-Fold Cross Validation and ROC curve/AUC

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
Cross-validation procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.
In addition to this method we use ROC curve/AUC in order to evaluate the model quality.
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve,AUC (Area under the ROC Curve).
AUC provides an aggregate measure of performance across all possible classification thresholds. So, we are looking for higher AUC that indicates a better model.

Under-fitting and Over-fitting

Under-fitting and over-fitting are two problems of machine learning. A model usually underperforms due to one of these reasons.
Under fitting happens when the model is too simple i.e. it contains less features to be trained or is regularized too much that the model couldn’t learn anything from the dataset which leads to less variance and too much biasness in predicting wrong outcomes. While on other hand over-fitting in models occur when they are trained so much on the training data that they eventually fail on providing a good prediction on any general unseen dataset (test data).
Both of these issues do not have any fixed solution but can be prevented through number of ways which are implemented in our model i.e:

Prediction

We choose "Random Forest" model, because it gave us the highest AUC(=92) which indicates to better prediction.
Choosing Random Forest predictions to submit on our output file:

Other Models

These models we tryed during making the project, they were excluded because of their low AUC.
All codes are written in Markdown and their not a part of the workflow.

Naive Bayers

Parameters:
priors=None ,var_smoothing=1e-09

Code:
nb = GaussianNB()
tprs = []
aucs = []
mean_fpr=np.linspace(0,1,100)
i=1
fig1 = plt.figure(figsize=[12,12])
cv = KFold(n_splits=5, random_state=7, shuffle=True)
for train_index, test_index in cv.split(x):
X_train = x.iloc[train_index]
X_test = x.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
nb.fit(X_train, y_train) # Run Models
prediction = nb.predict_proba(X_test)
fpr, tpr, t = roc_curve(y_test, prediction[:, 1])
tprs.append(np.interp(mean_fpr, fpr, tpr))
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
print("Training Data Accuracy:", nb.score(X_train,y_train)100)
print("Test Data Accuracy:", nb.score(X_test,y_test)
100)
plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i= i+1
plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()

naive%20bayers.png

In This Model:

Code:
models(GaussianNB(), X_train, X_test, y_train, y_test,test_df)

naive%20bayers2.png

Decision Tree Classifier

Parameters:

ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best'

Code:
dt = DecisionTreeClassifier()
scores = []
y_preds =[]
tprs = []
aucs = []
mean_fpr=np.linspace(0,1,100)
i=1
fig1 = plt.figure(figsize=[12,12])

cv = KFold(n_splits=3, random_state=47, shuffle=True)
for train_index, test_index in cv.split(x):
X_train = x.iloc[train_index]
X_test = x.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
dt.fit(X_train, y_train) # Run Models
prediction = dt.predict_proba(X_test)
fpr, tpr, t = roc_curve(y_test, prediction[:, 1])
tprs.append(np.interp(mean_fpr, fpr, tpr))
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
scores.append(dt.score(X_test, y_test))
print("Training Data Accuracy:", dt.score(X_train,y_train)100)
print("Test Data Accuracy:", dt.score(X_test,y_test)
100)
y_preds.append(dt.predict(X_test))
plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i= i+1

plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()

tree.png

In This Model:

Code:
models(DecisionTreeClassifier(), X_train, X_test, y_train, y_test,test_df)

tree2.png

SVC

Parameters:

C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False

Code:
svc = SVC(probability=True)
scores = []
y_preds =[]
tprs = []
aucs = []
mean_fpr=np.linspace(0,1,100)
i=1
fig1 = plt.figure(figsize=[12,12])

cv = KFold(n_splits=3, random_state=47, shuffle=True)
for train_index, test_index in cv.split(x):
X_train = x.iloc[train_index]
X_test = x.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
svc.fit(X_train, y_train) # Run Models
prediction = svc.predict_proba(X_test)
fpr, tpr, t = roc_curve(y_test, prediction[:, 1])
tprs.append(np.interp(mean_fpr, fpr, tpr))
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
scores.append(dt.score(X_test, y_test))
print("Training Data Accuracy:", svc.score(X_train,y_train)100)
print("Test Data Accuracy:", svc.score(X_test,y_test)
100)
y_preds.append(svc.predict(X_test))
plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i= i+1

plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()

svc.png

In This Model:

Code:
models(SVC(probability=True), X_train, X_test, y_train, y_test,test_df)

svc2.png

Note:

All The visulations and steps are explained also in the report.

i-hope-you-like-it.jpg